Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext
نویسندگان
چکیده
We consider the problem of learning general-purpose, paraphrastic sentence embeddings in the setting of Wieting et al. (2016b). We use neural machine translation to generate sentential paraphrases via back-translation of bilingual sentence pairs. We evaluate the paraphrase pairs by their ability to serve as training data for learning paraphrastic sentence embeddings. We find that the data quality is stronger than prior work based on bitext and on par with manually-written English paraphrase pairs, with the advantage that our approach can scale up to generate large training sets for many languages and domains. We experiment with several language pairs and data sources, and develop a variety of data filtering techniques. In the process, we explore how neural machine translation output differs from humanwritten sentences, finding clear differences in length, the amount of repetition, and the use of rare words.1
منابع مشابه
Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
We extend the work of Wieting et al. (2017), back-translating a large parallel corpus to produce a dataset of more than 51 million English-English sentential paraphrase pairs in a dataset we call PARANMT-50M. We find this corpus to be cover many domains and styles of text, in addition to being rich in paraphrases with different sentence structure, and we release it to the community. and release...
متن کاملRevisiting Recurrent Networks for Paraphrastic Sentence Embeddings
We consider the problem of learning general-purpose, paraphrastic sentence embeddings, revisiting the setting of Wieting et al. (2016b). While they found LSTM recurrent networks to underperform word averaging, we present several developments that together produce the opposite conclusion. These include training on sentence pairs rather than phrase pairs, averaging states to represent sequences, ...
متن کاملTowards Universal Paraphrastic Sentence Embeddings
We consider the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database (Ganitkevitch et al., 2013). We compare six compositional architectures, evaluating them on annotated textual similarity datasets drawn both from the same distribution as the training data and from a wide range of other domains. We find that the most complex ar...
متن کاملParaphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion
We present a substitution-only approach to sentence compression which “tightens” a sentence by reducing its character length. Replacing phrases with shorter paraphrases yields paraphrastic compressions as short as 60% of the original length. In support of this task, we introduce a novel technique for re-ranking paraphrases extracted from bilingual corpora. At high compression rates1 paraphrasti...
متن کاملLIPN-IIMAS at SemEval-2017 Task 1: Subword Embeddings, Attention Recurrent Neural Networks and Cross Word Alignment for Semantic Textual Similarity
In this paper we report our attempt to use, on the one hand, state-of-the-art neural approaches that are proposed to measure Semantic Textual Similarity (STS). On the other hand, we propose an unsupervised cross-word alignment approach, which is linguistically motivated. The neural approaches proposed herein are divided into two main stages. The first stage deals with constructing neural word e...
متن کامل